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Abstract 

The data-processing inequality, that is, I{U;Y) < I{U;X) for a Markov chain ?7 —>■ A —>■ 
y, has been the method of choice for proving impossibility (converse) results in information 
theory and many other disciplines. Various channel-dependent improvements (called strong 
data-processing inequalities, or SDPIs) of this inequality have been proposed both classically and 
more recently. In this note we first survey known results relating various notions of contraction 
for a single channel. Then we consider the basic extension: given SDPI for each constituent 
channel in a Bayesian network, how to produce an end-to-end SDPI? 

Our approach is based on the (extract of the) Evans-Schulman method, which is demon¬ 
strated for three different kinds of SDPIs, namely, the usual Ahslwede-Gacs type contraction 
coefficients (mutual information), Dobrushin’s contraction coefficients (total variation), and fi¬ 
nally the F/-curve (the best possible non-linear SDPI for a given channel). Resulting bounds 
on the contraction coefficients are interpreted as probability of site percolation. As an example, 
we demonstrate how to obtain SDPI for an n-letter memoryless channel with feedback given an 
SDPI for n = 1. 

Finally, we discuss a simple observation on the equivalence of a linear SDPI and comparison 
to an erasure channel (in the sense of “less noisy” order). This leads to a simple proof of a 
curious inequality of Samorodnitsky (2015), and sheds light on how information spreads in the 
subsets of inputs of a memoryless channel. 
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1 Introduction 


Multiplication of a componentwise non-negative vector by a stochastic matrix results in a vector that 
is “more uniform”. This observation appears in several classical works [Mar06,Doe37,Bir57] differ¬ 
ing in their particular way of making quantitative estimates. For example, Birkhoff’s work [Bir57] 
initiated a study (sometimes known as geometric ergodicity) of contraction of the projective dis¬ 
tance dp{x, y) = log maxj log min* ^ between vectors in M!j:. Here, instead, we will be interested 
in contraction of statistical distances and information measures involving probability distributions, 
which we define next. 

Fix a transition probability kernel (channel) Py\x '■ X ^ y acting between two measurable 
spaces. We denote by Py\x°P the distribution on y induced by the push-forward of the distribution 
P, which is the distribution of the output Y when the input X is distributed according to P, and 
by P X Py\x the joint distribution Pxy if Px = P- We also denote by Pz\y ° Py\x the serial 
composition of channels.^ 

We define three quantities that will play key role in our discussion: the total variation, the 
Kullback-Leibler (KL) divergence and the mutual information 


dpx{P,Q) = ^MP[E]-Q[E\\ = \ [\dP-dQ\, 

E 2 J 

( 1 ) 

r d p 

D[P\\Q)^ J^og — dP, 

(2) 

I{A-B) ^ D{Pab\\PaPb). 

(3) 


The purpose of this paper is to give exposition to the phenomenon that upon passing through 
a non-degenerate noisy channel distributions become strictly closer and this leads to a loss of 
information. Namely we have three effects: 

1. Total-variation (or Dobrushin) contraction: 

dTv(^V|x o P, Py\x °Q) < dTv{P, Q) • 

2. Divergence contraction: 

D{Py\x o P\\Py\x o Q) < D{P\\Q) 

3. Information loss: For any Markov chain^ U ^ X ^ Y we 

I{U-,Y)<I{U-,X). 

These strict inequalities are collectively referred to as strong data-processing inequalities (SDPIs). 
The goal of this paper is to show intricate interdependencies between these effects, as well as 
introducing tools for quantifying how strict these SDPIs are. 

^More formally, we should have written Py\x '■ P('^) P(T) as a map between spaces of probability measures 
P(-) on respective bases. The rationale for our notation Py\x : T ^ T is that we view Markov kernels as randomized 
functions. Then, a single distribution P on T is a randomized function acting from a space of a single point, i.e. 
P : [1] —>■ X, and that in turn explains our notation Py\x ° P for denoting the induced marginal distribution. 

^The notation A ^ B ^ C simply means that A i C\B. 
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Organization In Section 2 we overview the case of a single channel. Notably, most of the results 
in the literature are proved for finite alphabets, i.e., |d:^||3^| < oo, with a few exceptions such as 
[CKZ98, PW16]. We provide in Appendix A a self-contained proof of some of these results for 
general alphabets. 

From then on we focus on the question; Given a multi-terminal network with a single source and 
multiple sinks, and given SDPIs for each of the channels comprising the network, how do we obtain 
an SDPI for the composite channel from source to sinks? It turns out that this question has been 
addressed implicitly in the work of Evans and Schulman [ES99] on redundancy required in circuits 
of noisy gates. Rudiments also appeared in Dawson [Daw75] as well as Boyen and Roller [BK98]. 

In Section 3 we present the essence of the Evans-Schulman method and derive upper bounds on 
the mutual information contraction coefficient t^kl for Bayesian networks (directed graphical mod¬ 
els). We also interpret the resulting bounds as probabilities of disupting end-to-end connectivity 
under independent removals of graph vertices (site percolation). Then in Section 4 we derive analo¬ 
gous estimates for Dobrushin’s coefficient r/xv that governs the contraction of the total variation on 
networks. While the results exactly parallel those for mutual information, the proof relies on new 
arguments using coupling. Finally, Section 5 extends the technique to bounding the F7-curves (the 
non-linear SDPIs). Section 6 concludes with an alternative point of view on mutual information 
contraction, namely that of comparison to an erasure channel. As an example we give a short 
proof of a result of Samorodnitsky [Saml5] about distribution of information in subsets of channel 
outputs. 

Notation Elements of the Cartesian product A” are denoted x” = (xi,..., Xn) to emphasize their 
dimension. Given a transition probability kernel from Py\x : A —)• T we denote Py\x ~ 
the kernel acting from A ” —)• A” componentwise independently: 

n 

Pyn|xn(y”|x") = '[\PY\xiyj\xj)- 
i=i 

To demonstrate the general bounds we consider the running example of Py\x being an n-letter 
binary symmetric channel (BSC), given by 

Y = X + Z, A, A G F^, Z ~ Bern((I)” (4) 

and denoted by BSC((5)"^. Throughout this paper 5=1 — 5. 

2 SDPI for a single channel 

2.1 Contraction coefficients for /-divergence and mutnal information 

Let / : (0,oo) —>■ M be a convex function that is strictly convex at 1 and /(I) = 0. Let Df{P\\Q) = 
IEQ[/(f)] denote the /-divergence of P and Q with P Q, cf. [Csi67].^ For example, the 
total variation (1) and the KL divergence (2) correspond to /(x) = ^|x — 1| and /(x) = xlogx 
respectively; taking /(x) = (x — 1)^ we obtain the y^-divergence: x^i^WQ) — ~ 

®More generally, Df{P\\Q) = ^ , where /r is a dominating probability measure of P and Q, e.g., 

fi = (P + Q)l‘i, with the understanding that /(O) = /(0-I-), 0f{^) = 0 and 0/(^) = hma;4,o xf{^) for a > 0. 
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For any Q that is not a point mass, define: 


Q) 


DfiPY\X°P\\PY\X°Q) 

gup _1_1_ 

P-.0<Df{P\\Q)<oo PfiP\\Q) 


Vf{PY\x) - snprjfiQ). 
Q 


(5) 

( 6 ) 


It is easy to show that the supremum is over a non-empty set whenever Q is not a point mass 
(see Appendix A). For notational simplicity when the channel is clear from context we abbreviate 
VfiPYlx) a-s Vf- For contraction coefficients of total variation, and KL divergence, we write 
r]Yy,r ]^2 and t/kL; respectively, which play prominent roles in this exposition. 

One of the main tools for studying ergodicity property of Markov chains as well as Gibbs 
measures, ??tv(FV|x) is known as the Dobrushin’s coefficient of the kernel Py\x- Dobrushin [Dob56] 
showed that the supremum in the definition of r/xv can be restricted to point masses, namely. 


r/Tv(FV|x) — SUpdTv(-fV|X=a:)-fV|X=a;')) (7) 

x,x' 

thus providing a simple criterion for strong ergodicity of Markov processes. Later [CKZ98, Propo¬ 
sition II.4.10(i)] (see also [CIR'''93, Theorem 4.1] for finite alphabets) demonstrated that all other 
contraction coefficients are upper bounded by the Dobrushin’s coefficient, with inequality being 
typically strict (cf. the BSC example below): 

Theorem 1 ([CKZ98, Proposition II.4.10]). For every f-divergence, we have 

^/(-fV|x) < ^Tv(FV|x)- (8) 

For the opposite direction, lower bounds on r/j typically involves r/^ 2 , the contraction coefficient 
of the x^-divergence. It is well-known, e.g. Sarmanov [Sar58], that r/^ 2 (Py|x, Rx) is the squared 
second largest eigenvalue of the conditional expectation operator, which in turn equals the maximal 
correlation coefficient of the joint distribution Pxy- 

SiX; Y) 4 sup p{f{X),g{Y)) = Jr?^ 2 (Py|x, Px), (9) 

f,9 ’ 

where /?(•, •) denotes the correlation coefficient and the supremum is over real-valued functions f,g 
such that f{X) and g{Y) are square integrable. 

The relationship between tjkl and r /^2 on finite alphabets has been systematically studied by 
Ahlswede and Gacs [AG76]. In particular, [AG76] proved 

Vx^i^yX^Px) <'nKL{PY\x,Px), (10) 

and noticed that the inequality is frequently strict."^ Furthermore, for finite alphabets, the following 
equivalence is demonstrated in [AG76]: 

??x2(Fx,Fy|x) < 1 rjKhiPx, Py\x) <'^ (H) 

graph {{x,y) : Px{x) > 0,Py|x(2/|a^) > 0} is connected. (12) 

As a criterion for r]f{PY\x^Px) < 1; this is an improvement of (8) only for channels with 7/Tv(-fV|x) = 
1. The lower bound (10) can in fact be considerably generalized: 

^See [AG76, Theorem 9] and [AGKN13] for examples. 
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Theorem 2. Let f be twice continuously differentiable on (0, oo) with > 0. Then for any 

Px that is not a point mass, 

hx^{PY\x^Px) < hfiPrix^Px), (13) 

and 

r]x^{PY\x) <r]f{PY\x) ■ (14) 

See Appendix A.l for a proof of (13) for the general case, which yields (14) by taking suprema 
over Px on both sides. Note that (14) (resp. (13)) have been proved in [CKZ98, Proposition II.6.15] 
for the general alphabet (resp. in [Ragl4, Theorem 3.3] for finite alphabets). 

Moreover, (14) in fact holds with equality for all nonlinear and operator convex /, e.g., for KL 
divergence and for squared Hellinger distance; see [CRS94, Theorem 1] and [CKZ98, Proposition 
II.6.13 and Corollary II.6.16]. Therefore, we have: 


Theorem 3. 


??X=(-fV|x) = m\^{PY\x) ■ 


(15) 


See Appendix A.l for a self-contained proof. This result was first obtained in [AG76] using 
different methods for discrete space. Rather naturally, we also have [CKZ98, Proposition II.4.12]: 


'nf{PY\x) = 1 


mv{PY\x) = 1 


for any non-linear /. 

As an illustrating example, for BSC((I) defined in (4), we have cf. [AG76] 

Vx^ = hKL = (1 — 25)^ < ?yTV = |1 — 25]. 


(16) 


Appendix B present general results on the contraction coefficients for binary-input arbitrary-output 
channels, which can be bounded using Hellinger distance within a factor of two. 

We next discuss the the fixed-input contraction coefficient r/KL(l^y|X) Q)- Unfortunately, there 
is no simple reduction to the y^-case as in (15). Besides the lower bound (10), there is a variety 
of upper bounds relating t/kl and 77 ^ 2 . We quote [MZ15, Theorem 11], who show for finite input- 
alphabet case: 

WL(n-IX.Q) < ■ 

Another bound (which also holds for all rjf with operator-convex /) is in [Ragl4, Theorem 3.6]: 


m(-Py|x,Q) < max r?x 2 (Py|x,Q), sup r]i,Cp{PY\x,Q) , 

V 0</9<i / 

where rji^Cp denotes contraction coefficient of an /-divergence ljCj^{P\\Q) = [3^ f with (3 € 

(0,1) and j3 = 1 — f3 (see also Appendix B). 

We also note in passing that SDPIs are intimately related to hypercontractivity and maximal 
correlation, as discovered by Ahlswede and Gacs [AG76] and recently improved by Anantharam et 
ah [AGKN13] and Nair [Nail4]. Indeed, the main result of [AG76] characterizes r 7 KL(l^y|jV)-Pv) as 
the maximal ratio of hyper-contractivity of the conditional expectation operator E[-jA]. 

The fixed-input contraction coefficient rj^xhiQ) is closely related to the (modified) log-Sobolev 
inequalities. Indeed, if 77 kl(Q) < 1 where Q is the invariant measure for the Markov kernel Py|x, i-e., 
Py\x°Q = Qj then any initial distribution P such that D{P\\Q) < 00 converges to Q exponentially 
fast since 

o P\\Q) < vMPy\x,Q)D{P\\Q), 
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where the exponent f?KL(-fV|X) Q) can in turn be estimated from log-Sobolev inequalities, e.g. [Led99]. 
When Q is not invariant, it was shown [DMLM03] that 


1 - a{Q) < r]KL{PY\x,Q) < 1 - Ca{Q) 

holds for some universal constant C, where a{Q) is a modihed log-Sobolev (also known as 1-log- 
Sobolev) constant: 


a{Q) 


E 


inf 


/2(X)log^ 


/^ 1 ,||/|| 2=1 E[/2(X)log/2(X)] 


Pxx' = Q X {Px\Y ° Py\x)- 


For further connections between t^kl and log-Sobolev inequalities on hnite alphabets see [Ragl3, 
Ragl4]. 

There exist several other characterizations of ijkl, such as the following one in terms of the 
contraction of mutual information (cf. [CK81, Exercise III.3.12, p. 350] for hnite alphabet): 

Vkl{Py\x) = sup 

where the supremum is over all Markov chains U ^ X ^ Y with hxed Py\x equivalently, over 
all joint distributions Pxu) such that I(U;X) < oo. This result is an immediate consequence of 
the following input-dependent version (see Appendix A.3 for a proof in the general case; the hnite 
alphabet case has been shown in [AGKN13]) 

Theorem 4. For any Px that is not a point mass, 

Vkl{Py\x,Px) = sup > (18) 

where the supremum is taken over all Markov chains U ^ X ^ Y with fixed Pxy = Px ° Py\x 
such that 0 < I{U-,X) < oo. 

Another characterization of ?/kL) iu view of (15) and (9), is 

Vkl{Py\x) = supp‘^{f{X),g{Y )), 

where the supremum is over all Px and real-valued square-integrable f{X) and g{Y). 


2.2 Non-linear SDPI 

How to quantify the information loss if t/kl = 1 for the channel of interest? In fact this situation 
can arise in very basic settings, such as the additive-noise Gaussian channel under the moment 
constraint on the input distributions (cf. [PW16, Theorem 9, Section 4.5]), where the mutual 
information does not contract linearly as in (17), but can still contract non-linearly. In such cases, 
establishing a strong-data processing inequality can be done by following the joint-range idea of 
Harremoes and Vajda [HVll]. Namely, we aim to hnd (or bound) the best possible data-processing 
function Fj dehned as follows. 

Definition 1 (Fj-curve). Fix Py\x und dehne 

Fiit,PY\x) = sup {/([/; T): I{U-,X) < t,PuxY = PuxPy\x} ■ (19) 

Pux 
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Equivalently, the supremum is taken over all joint distributions PjjXY with a given conditional 
Py\x satisfying U ^ X ^ Y. The upper concave envelope of Fj is denoted by Ff: 

Ffit, Py\x) - inf{/(t) : Vt' > 0 Py\x) < /-concave} . 

Equivalently, we have 


Ff{t,PY\x) = sup {/([/; y|E): I{U;X\V) < t, PyuxY = PvuxPy\x} , (20) 

Pvux 


where I{A] B\C) = I{A,C; B) — I{C]B) is the conditional mutual information, and averaging over 
V serves the role of concavihcation (so that V can be taken binary). Whenever it does not lead to 
confusion we will write TV|x(0 instead of Fj{t, Py\x)- 


The operational significance of the Fj-cuwe is that it gives the optimal input-independent strong 
data processing inequality: 


I{U-,Y)<Fi{I{U-,X)), 


which generalizes (17) since T}(0) = '??KL(-fV|x) nnd t i—)• jFi{t) is decreasing (see, e.g., [CPW15, 
Section I]). See [CPW15] for bounds and expressions for BSC and Gaussian channels. 

Frequently it is more convenient to work with the concavified version Ff as it allows for some 
natural extension of the results about contraction coefficients. Proposition 18 shows that Fj may 
not be concave. 


2.3 Some applications: classical and new 

The main example of a strong data-processing inequality (SDPI) was discovered by Ahlswede and 
Gacs [AG76]. They have shown, using the characterization (11), that whenever Py\x is a discrete 
memoryless channel that does not admit zero-error communication, we have ?7KL(Py|x) < ^ < 1 
and 

I{W;Y) <r]I{W;X) (21) 

for all Markov chains VP —)• A —)• T. 

SDPIs have been popular for establishing lower (impossibility) bounds in various setups, in both 
classical and more recent works. We mention only a few of these applications: 

• By Dobrushin for showing non-existence of multiple phases in Ising models at high tempera¬ 
tures [Dob70]; 

• By Erkip and Cover in portfolio theory [EC98]; 

• By Evans and Schulman in analysis of noise-resistant circuits [ES99]; 

• By Evans, Kenyon, Peres and Schulman in the analysis of inference on trees and percola¬ 
tion [EKPSOO]; 

• By Courtade in distributed data-compression [Coul2]; 

• By Duchi, Wainwright and Jordan in statistical limitations of differential privacy [DJW13]; 

• By the authors to quantify optimal communication and optimal control in line networks 
[PW16]; 

• By Liu, Cuff and Verdii in key generation [LCV15]; 


• By Xu and Raginsky in distributed estimation [XR15]. 

All of the applications above use SDPI (21) to prove negative (impossibility) statements. A 
notable exception is the work of Boyen and Roller [BK98], who considered the basic problem of 
computing the posterior-belief vector of a hidden Markov model: that is, given a Markov chain {Xj} 
observed over a memoryless channel Py\Xj oh® aims to recompute s-s each new observation 

Yj arrives. The problem arises when X is of large dimension and then for practicality one is 
constrained to approximate (quantize) the posterior. However, due to the recursive nature of belief 
computations, the cumulative effect of these approximations may become overwhelming. Boyen 
and Roller [BR98] proposed to use the SDPI similar to (21) with r/ < 1 for the Markov chain {Xj} 
and show that this cumulative effect stays bounded since < oo. Similar considerations also 

enable one to provide provable guarantees for simulation of inter-dependent stochastic processes. 

3 Contraction of mutual information in networks 

We start by defining a Bayesian network (also known as a direeted graphieal model). Let G be a 
finite directed acyclic graph with set of vertices {Yy : v & V} denoting random variables taking 
values in a fixed finite alphabet.^ We assume that each vertex Yy is associated with a conditional 
distribution where pa(u) denotes parents of v, with the exception of one special “source” 

node X that has no inbound edges (there may be other nodes without inbound edges, but those 
have to have their marginals specified). Notice that if R C V is an arbitrary set of nodes we can 
progressively chain together all the random transformations and unequivocally compute Pv\x (here 
and below we use V and Yy = {Yy : v & V} interchangeably). We assume that vertices in V are 
topologically sorted so that vi > V 2 implies there is no path from ui to V 2 . Associated to each node 
we also define 

See the excellent book of Lauritzen [Lau96] for a thorough introduction to a graphical model 
language of specifying conditional independencies. 

The following result can be distilled from [ES99]: 

Theorem 5. Let W € V and R C V such that IR > R. Then 

'nKL{Pv,W\x) < hw ■ ^KL(TV,pa(VF)|x) + (1 “ Vw) ' ^Kl(TV|x) ■ (22) 

Furthermore, let perc(R) denote the probability that there is a path from X to R® in the graph if 
each node v is removed independently with probability 1 — r]y (site percolation). Then, we have for 
every R C V 

Vkl{Pv\x) < perc(R). (23) 

In particular, if Tjy < 1 for all v then r/KL(TV|x) < 1- 

Proof. Consider an arbitrary random variable U such that 

U ^ X ^ {V,W). 

®At the expense of technical details, the alphabet can be replaced with any countably-generated (e.g. Polish) 
measurable space. For clarity of presentation we focus here on finite alphabets. 

®More formally, perc(R) equals probability that there exists a sequence of nodes vi,... ,Vn with vi = X, Vn € V 
satisfying two conditions: 1) for each i £ [n — 1] the pair (ui,Wi+i) is a directed edge in G; and 2) each Vi is not 
removed. 
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Let A = pa(VL) \ V. Without loss of generality we may assume A does not contain X: indeed, if 
A includes X then we can introduce an artificial node X' such that X' = X and include X' into A 
instead of X. Relevant conditional independencies are encoded in the following graph: 

U -^ X-^ V 

A -^ W 

From the characterization (17) it is sufficient to show 

m; V, W)<(1- R) + R, • 

Denote B = R\pa(lR) and C = Rnpa(fR). Then pa(fR) = {A,C) and R 
notice that by assumption we have 

U ^ X ^ {V, A) . 

Therefore conditioned on R we have the Markov chain 

U ^ X ^ A^W |R 

and the channel ^ IR is a restriction of the original Pw\pa.{w) to a subset of the inputs. Indeed, 
Pw\Ay = -fV|pa(iy),B = Pw\pa.{w) tiy the assumption of the graphical model. Thus, for every 
realization v = (6, c) of R, we have Pw\A=ay=v = Pw\A=a,c=c therefore 

/([/; W\V = v)< viPw\A,c=c)HU; A\V = v)< v{Pw\A,c)I{U-, A|R = u), (25) 

where the last inequality uses the following property of the contraction coefficient which easily 
follows from either (6) or (17): 

snY>r]{Pw\A,c=c) < ri{Pw\A,c)- (26) 

C 

Averaging both sides of (25) over v ~ Py and using the definition rj\Y = r?(FV|pa(ty)) = 'n{Pw\A,c)-: 
we have 

I{U-W\V)<ilwI{U-,A\V). (27) 

Adding I{U]V) to both sides yields (24). 

We now move to proving the percolation bound (23). First, notice that if a vertex IR satisfies 
VF > R, then letting {3 tt : A —>■ R} be the event that there exists a directed path from X to (any 
element of) the set R under the site percolation model, we notice that {IR removed} is independent 
from {3 TT : A —)• Rj and (3 tt : A —)• R U pa(IR)}. Thus we have 

perc(R U {VR}) = P[3 vr : A ^ R U {IR}] 

= P[3 TT : A ^ R U {IR}, IR removed] + P[3 tt : A ^ R U {IR}, IR kept] 

= P[3 TT : A —)• R, IR removed] + P[3 tt : A —)• R U pa(IR), IR kept] 

= P[3 TT : A ^ R](l - r]w) + tt : A ^ R U pa(IR)] 

= (1 — r/vi/)perc(R) + r 7 vi/perc(R U pa(IR)). 

That is, the set-function perc(-) satisfies the recursion given by the right-hand side of (22). Now 
notice that (23) holds trivially for R = {A}, since both sides are equal to 1. Then, by induction 
on the maximal element of R and applying (22) we get that (23) holds for all R. □ 


(24) 

= {B,C). To verify (24) 
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Theorem 5 allows us to estimate contraction coefficients in arbitrary (finite) networks by peeling 
off last nodes one by one. Next we derive a few corollaries: 


Corollary 6. Consider a fixed (single-letter) channel Py\x o,nd assume that it is used repeatedly 
and with perfect feedback to send information from W to {Yi,... ,Yn). That is, we have for some 
encoder functions fj 

n 

= WPY\x{yj\fj{w,y^~^)), 


which corresponds to the graphical model: 


Then 



VKl{Py'^\w) < 1 — (1 — ^KL(.fV|x))"' < ^ • ^KL(.fV|x) 


Proof. Apply Theorem 5 n times. □ 

Let us call a path vr = {X, • • • ,v) with u G F to be shortcut-free from X to V, denoted X , 
if there does not exist another path P from X to any node in V such that P is a subset of tt. (In 
particular v necessarily is the first node in V that tt visits.) Also for every path vr = (X, vi,, Vm) 
we define 

m 

- n ■ 

i=i 

Corollary 7. For any subset V we have 

VKLiPvix) < "rf ■ (28) 

TTlX^V 

In particular, we have the estimate of Evans-Schulman [ES99]: 

9kl{Pv\x) < E . (29) 

■K-.X^V 


Proof. Both results are simple consequence of union-bounding the right-hand side of (23). But for 
completeness, we give an explicit proof. First, notice the following two self-evident observations: 

1. If A and B are disjoint sets of nodes, then 

E = E + E (30) 

■jt-.X‘4auB 7t-.X^A, avoid B avoid A 

2. Let TT : X ^ V and tti be tt without the last node, then 

tt-.xAv ^ TTi : V 4 {pa(V) \ V}. (31) 
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Now represent V = iV',W) with W > V\ denote P = pa(Vh) \ V and assume (by induction) 
that 


riKL(Pv'lx) < '^ 

n-.X"4v 

mhiPv',p\x) < ^ ff ■ 

■k-.x4{V',P} 

By (30) and (31) we have 


E »'= E E I' 

n:x4v it:x4v' n:x4w, avoid V 

7t:x4v' it:x4p, avoid V 


Then by Theorem 5 and induction hypotheses (32)-(33) we get 


??kl(^V|x) < W ^ Tf + 4- Vw) 


n-.x4v' 


= m 


< m 


t-.x4{v',p} 

E E "']+ E 

\n-.x4p, avoid V' Tr-.X4V', pass P / 7v: x4v’ 

E ^’+ E »' 


7r:X4P, avoid V 


t:x4v' 


(32) 

(33) 

(34) 

(35) 

(36) 

(37) 

(38) 


where in (37) we 
pass nodes in P. 


gf 

applied (30) and split the summation over n ■. X V' into paths that avoid and 
Comparing (35) and (38) the conclusion follows. □ 


Both estimates (28) and (29) are compared to that of Theorem 5 in Table 1 in various graphical 
models. 


Evaluation for the BSC We consider the contraction coefficient for the n-letter binary sym¬ 
metric channel BSC((5)”' defined in (4). By (16), for n = 1 we have t/kl = (1 ~ 25)^. Then by 
Corollary 6 we have for arbitrary n: 


r/KL < l-(4<5(l-<5))”. (39) 

A simple lower bound for t/kl can be obtained by considering (17) and taking U ~ Bern(l/2) 
and U ^ X being an n-letter repetition code, namely, X = (U ,..., U). Let'^ e = P[|.Z’| > n/2] be 
the probability of error for the maximal likelihood decoding of U based on Y, which satisfies the 
Chernoff bound e < (4(5(1 — (5))"'^^. We have from Jensen’s inequality 

I{U; Y) = H{U) - H{U\Y) > 1 - h{e) = 1 - (4(5(1 - (5))t+o(i°g^) ^ 

^For elements of FJ, hi is the Hamming weight. 
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Name 

Graph 

Theorem 5 

Estimate (28) via 
shortcut-free paths 

Original Evans-Schulman 
estimate (29) 

Markov chain 1 

X^Yi^ B^Y2 

rj 

V 

rj + rf 

Markov chain 2 

A 

X -- ^Y 

rf' 

rf 

jfi _|_ ^3 

Parallel channels 

Yi 

X —-y. 

2rj — if 

2r] 

2rj 

Parallel channels 

with feedback 

Id 

y—-y2 

2rj — rf 

2r] 

3rj 


Table 1: Comparing bounds on the contraction coefficient ?7KL(-fV|x)' Foi" simplicity, we assume 
that the t/kl coefficients of all constituent kernels are bounded from above by rj. 


where we used the fact that the binary entropy/i(x) = —rclogx —(1—x) log(l—x) = —xlogx+0(x^) 
as X ^ 0. Consequently, we get 

m > 1 - (45(1 - 5))t+0(i°g^) _ 

Comparing (39) and (40) we see that t/kl 1 exponentially fast. To get the exact exponent we 
need to replace (39) by the following improvement: 

f?KL < ?/TV < 1 — (45(1 — J)) J 

where the hrst inequality is from (8) and the second is from (48) below. Thus, all in all we have 
for BSC(5)”' as n —)• oo 

VKL,VTY = 1 - (45(1 - 5))t+0('°Sn) _ (41) 

4 Dobrushin’s coefficients in networks 

The proof of Theorem 5 relies on the characterization (17) of rjKh via mutual information, which sat¬ 
isfies the chain rule. Neither of these two properties is enjoyed by the total variation. Nevertheless, 
the following is an exact counterpart of Theorem 5 for total variation. 

Theorem 8. Under the same assumption of Theorem 5, 

'nTY{Pv,w\x) < (1 - 'nw)'my{Pv\x) + 'nw'my{Ppa{w),v\x), (42) 

where rjw = '7Tv(Av|pa(w))- Furthermore, let perc(C) denote the probability that there is a path 
from X to V in the graph if each node v is removed independently with probability 1 — (site 
percolation). Then, we have for every 1/ C V 

VTy{Pv\x) < perc(l/). (43) 

In particular, if rjy <1 for all v gV, then r]Ty{Pv\x) < 1- 
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Proof. Fix X, X and denote by P (resp. Q) the distribution conditioned on X = x (resp. x'). Denote 
Z = pa(VF). The goal is to show 

dTviPvw, Qvw) < (1 - ilw)dTY{Pv,Qv) + rjwd'TviPzv, Qzv) ■ (44) 

which, by the arbitrariness of x, x' and in view of the characterization of r/ in (7), yields the desired 
(42). By Lemma 22 in Appendix C, there exists a coupling of Pzv and Qzv-, denoted by 'Xzvz'V'-, 
such that 


^[(' 2 ', y) / (^^ y')] = dTY{Pzv,Qzv), 

Tr[V / V'] = dTY{Pv,Qv) 

simultaneously (that is, this coupling is jointly optimal for the total variation of the joint distribu¬ 
tions and one pair of marginals). 

Conditioned on Z = z and Z' = z' and independently of DC', let WW be distributed according 
to a maximal coupling of the conditional laws Pw\z=z and Pw\z=z' (recall that Qwjz = Pw\z = 
Pw\pa.{W) by definition). This defines a joint distribution t^zvwz'V'W’i under which we have the 
Markov chain W — >• ZZ' WW. Then 

7r[fF / W'\ZVZ'V'] = 7r[fT 7 ^ W'\ZZ'] = d^Y{Pw\MW)=z,Pw\MW)=z') < Vw^z^z'}- 
Therefore we have 

7r[W / W'jV = D'] = E[7r[fF / W'\ZZ'\\V = D'] 

< r,wT^[Z + Z'\V = D']. 

Multiplying both sides by 7 r[y = V'\ and then adding 7 r[D 7 ^ V'], we obtain 

7r[{W, V) ^ (PF', V')] < (1 - 77M/)7r[F + F'] + W7r[(^, F) + {Z\ F')] 

= (1 - 'nw)d'TY{PviQv) + VwdTviPzV, Qzv), 

where the LHS is lower bounded by dTY{Pwv,Qwv) and the equality is due to the choice of vr. 
This yields the desired (44), completing the proof of (42). The rest of the proof is done as in 
Theorem 5. □ 

As a consequence of Theorem 8 , both Corollary 6 and 7 extend to total variation verbatim with 
r/KL replaced by t/tv^ 

Corollary 9. In the setting of Corollary 6 we have 

VTviPY^lw) < 1 - (1 - r/Tv(-fV|x))"' < n ■ hKL{PY\x) ■ (45) 

Corollary 10. In the setting of Corollary 7 we have 

VTviPvlx) < ^ ^TV — ^ ^TV ) 

m 

where for any path ir = {X,vi,... ,Vm) we denoted pJfy = H VTy{Pvj\p!,{vj))- 

i=i 
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Evaluation for the BSC Consider the n-letter BSC defined in (4), where Y = X -\- Z with 
Z ~ Bern(5)" and \Z\ ~ Binom(n, <5). By Dobrushin’s characterization (7), we have 

?7tv = max dTv(-fV|x=x)-fV|x=3;') 

x,x'eFi} ' ' 


(iTv(Bern((5)”, Bern(l — (5)”) 


dTv(Binom(n, (5), Binom(n, 1 — 5)) 

(46) 

1 - 2P[|Z| > n/2] -P[|Z| = n/2] 

(47) 

1-(4(5(1-<5))t+o(i°g^), 

(48) 


where (46) follows from the sufficiency of |Z| for testing the two distributions, (47) follows from 
dTy{P, Q) = l — JPAQ and (48) follows from standard binomial tail estimates (see, e.g., [Ash65, 
Lemma 4.7.2]). The above sharp estimate should be compared to the bound obtained by applying 
Corollary 9: 

^TV ^ 1 ~ (2(5)”'. (49) 

Although (49) correctly predicts the exponential convergence of ?7 tv 1 whenever <5 < ^, the 
exponent estimated is not optimal. 

5 Bounding F/-curves in networks 

In this section our goal is to produce upper bound bounds on the T/-curve of a Bayesian network 
Fv\x m terms of those of the constituent channels. For any vertex v of the network, denote the 
F/-curve of the channel Py\pa{v) by Fv\pa{v)^ abbreviated by F^, and the concavified version by F^. 

Theorem 11. In the setting of Theorem 5, 

Fv,w\x < Fv\x + Fw ° (-^pa(w),y|x - Fy\x) > (50) 

^v,w\x — ^v\x + Fw ° (-^pa(iv),y|x “ Fy\x) ■ (51) 

Furthermore, the right-hand side of (51) is non-negative, eoncave, nondecreasing and upper bounded 
by the identity mapping id. 

Remark 1. The F/-curve estimate in Theorem 11 implies that of contraction coefficients of Theo¬ 
rem 5. To see this, note that since Tpa(w),y|x < id, the following is a relaxation of (50): 

id - Fy^w\x > (id - Fw) o (id - Fy\x)- (52) 

Consequently, if each channel in the network satisfies an SDPI, then the end-to-end SDPI is also 
satisfied. That is, if each vertex has a non-trivial F/-curve, i.e., F^ < id for all u € V, then the 
channel X ^ V also has a strict contractive property, i.e., Fy^x < id- 

Furthermore, since F^{t) < rjwt, noting the fact that Fy^^{0) = f?KL(-fV|x) and taking the 
derivative on both sides of (50) we see that the latter implies (22). 

Proof. We first show that for any channel Py\Xj hs FV|x-curve satisfies that t t — Fyixit) is 

nondecreasing. Indeed, it is known, cf. [CPW15, Section I], that 1 is nonincreasing. Thus, 
for ti < t 2 we have 

h - FY\x{h) >t2- ^FV|x(ti) 

= ^ (ti - FY\x{ti)) 

>ti- FY\x{ti ), 


15 



where the last step follows from the fact that -Fy|x(^) — Similarly, for any concave function 

: M_|_ M_|_ s.t. <h(0) = 0 we have —Therefore, the argument above implies 

1t — <h(t) is nondecreasing and, in particular, so is 1t — F^{t). 

Let Pjjx be such that I{U;X) < t and I{U‘,W,V) = By the same argument that 

leads to (27) we obtain 

I{U-, WjV = vo) < Fw{I(U; = uq)) 

<F^{I{U-,A\V = vo)). 


Averaging over vq ~ Py and applying Jensen’s inequality we get 


I{U; W, y) < F^(I(U; pa(Ty), F) - I(U; V)) + I(U; V). 


Therefore, 

^v,wix(t) < Pa(lT), y) - I(U; V)) + I{U; V) 

< F^{Fp,iW),v\x{t) - I{U-, y)) + /([/; 1/) (53) 

= -Tpa(vi/),F|x(i) - (id - T4')(-Tpa(vi/),y|x(i) - HU;V)) 

< Fpa.(w)y\x{t) - (id - T(^)(Tpa(vi/),y|x(i) - Fy\xit)) (54) 

= Fy\xit) + TM/(-Fpa(vi/),y|x(^) — Fy^xit)) 

— Fy\xi^) + -^w(-^pa(iy),y|x(^) “ Fy^xi't)) (^5) 

where (53) and (54) follow from the facts that 1 1 ->- F\y{t) and t e-)- t — F\y{t) are both nondecreasing, 
and (55) follows from that a + F^{b — a) is nondecreasing in both a and b. 

Finally, we need to show that the right-hand side of (55) is nondecreasing and concave (this 
automatically implies that (55) is an upper-bound to the concavification T()|^). To that end, denote 
h = Ati -F (1 - A)to, /a = F^ij^(h), gx = Tp,^(u/),y|x(*A) and notice the chain 


/a + F^{gx — fx) > A/i -I- (1 — A)/o -I- F^{X{gi — fi) -|- (1 — A)(5'o — /o)) (56) 

> A(/i -|- F^{gi — fi)) -I- (1 — A)(/o -I- F^{go — /o)) (57) 


where (56) is from concavity of Fy^^, -^pa(VK) y|x monotonicity of (o, 6 ) i-)- a -|- F^{b — a), 
and (57) is from concavity of F^. □ 

Corollary 12. In the setting of Corollary 6 we have 

FY^\w{t) <t- , 

where = if, = if^^l o if and if : M+ —)• M+ is a convex function such that 


FY\x{t) < t - V’(0 • 

Proof. The case of n = 1 follows from the assumption on if. The case of n > 1 is proved by induction, 
with the induction step being an application of Theorem 11 with V = Y"'~^ and W = Y^. □ 

Generally, the bound of Corollary 12 cannot be improved in the vicinity of zero. As an example 
where this is tight, consider a parallel erasure channel, whose Fj-curve for t < log (7 is computed in 
Theorem 17 below. 
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Evaluation for the BSC To ease the notation, all logarithms are with respect to base two in 
this section. Let h{y) = ylog^ + (1 — y) log denote the binary entropy function and h~^ : 
[0,1] [0, its functional inverse. Let p* q = p{\ — q) + <?(! — p) for p,q & [0,1] denote binary 

convolution and define 

= t — 1 + h{6 * /i“^(max(l — t, 0))) (58) 

which is convex and increasing in t on M_|_. For n = 1 it was shown in [CPW15, Section 2] that the 
F/-curve of BSC((i) is given by 

Fi{t, BSC(J)) = Ff{t, BSC((5)) = t - i){t ). 


Applying Corollary 12 we obtain the following bound on the F/-curve of BSC of blocklength n 
(even with feedback): 


Proposition 13. Let Zi,..., Z^'^'Bern{5) he independent of U. For any (encoder) functions 
fj,j = define 

A, = /,([/, Y,=X, + Zj. 


Then 


where = if, 


/([/; y”) < I{U; A”) - {I{U; X^)), 


o if and i) is defined in (58). 


(59) 


Remark 2. The estimate (59) was first shown by A. Samorodnitsky (private communication) under 
extra technical constraints on the joint distribution of (A"', W) and in the absence of feedback. We 
have then observed that Evans-Schulman type of technique yields (59) generally. 

Since 'fi{t) = 45(1 — 5)t + o{t) as t —)• 0 we get 

Ff{t, BSC(5)^) < t - t(45(l - 5))"+'’(”) 


as n ^ oo for any fixed t. A simple lower bound, for comparison purposes, can be inferred from (40) 
after noticing that there we have I{U;X) = 1, and so 

Ff{l, BSC{5)^) > 1 - (45(1 - 5))t+o(i°g^) ^ 


This shows that the bound of Proposition 13 is order-optimal: F{t) t exponentially fast. Exact 
exponent is given by (41). 

As another point of comparison, we note the following. Existence of capacity-achieving error- 
correcting codes then easily implies 

lim -FfinO, BSC(5)’") = min(0, C ), 

n—>-oo Tl 

where (7=1 — h{6) is the Shannon capacity of BSC(5). Since for t > 1 we have 'i){t) = t — C one 
can show that 

lim = |0-(7|+ , 

n—¥oo 77 , 

and therefore we conclude that in this sense the bound (59) is asymptotically tight. 
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6 SDPI via comparison to erasure channels 


So far our leading example has been the binary symmetric channel (4). We now consider another 
important example; 

Example 1. For any set Al, the erasure channel on X with erasure probability 5 is a random 
transformation from X to X D {?}, where ? ^ X defined as 


Pe\x{^\x) 


6, e =? 

1 — (5, e = X 


For X = [q], we call it the q-ary erasure channel denoted by ECg(<5). In the binary case, we denote 
the binary erasure channel by BEC(5) = EC2(5). A simple calculation shows that for every Pjjx 
we have 

I{U-,E) = {l-5)I{U-X) (60) 

and therefore for ECq((5) we have r?KL(-PE|x) = 1 — <5 and Fi{t) = min((l — (I)t,logg). 

Next we recall a standard information-theoretic ordering on channels, cf. [EGKll, Section 5.6]: 

Definition 2. Given two channels with common input alphabet, Py\x a-nd Pyi\X) we say that 
Pyi\x is less noisy than Py\x-, denoted by Py\x Py'\x if for all joint distributions Pjjx we 
have 

I{U-,Y) < I{U-,Y'). (61) 

We also have an equivalent formulation in terms of divergence: 


Proposition 14. Py\x ^i.n. Py'\x */ only if for all Px^Qx we have 

D{Qy\\Py) < D{Qy>\\Py') 


(62) 


where Py, Py',QytQy' are the output distributions induced by Px,Qx over Py\x ond Py'\x> re- 
spectively. 

See Appendix A.4 for the proof.® 

The following result shows that the contraction coefficient of KL divergence can be equivalently 
formulated as being less noisy than the corresponding erasure channel:® 

Proposition 15. For an arbitrary channel Py\x we have 

'hKY{PY\x) < h Py\X ^l.n. Pe\X ) (63) 

where Pe\x is the erasure channel on the same input alphabet and erasure probability 1 — rj. 

Proof. The definition of r?KL(^V|x) guarantees for every Pjjx 

I{U-,Y)<{l-d)I{U-,X), (64) 

where the right-hand side is precisely I{U]E) by (60). □ 

®It is tempting to put forward a fixed-Px version of the previous criterion (similar to Theorem 4). That would, 
however, require some extra assumptions on Px ■ Indeed, knowing that I (W \Y) < I (W ; Y') for all Pw,x with a given 
fixed Px tells us nothing about how distributions Py\x=x and Py>\x=x compare outside the support of Px- (For 
discrete channels and strictly positive Px, however, it is easy to argue that indeed (62) holds for all Qx if and only 
if (61) holds for all Pu,x with a given marginal Px-) 

®Note that another popular partial order for random transformations - that of stochastic degradation - may also 
be related to contraction coefficients, see [Ragl4, Remark 3.2]. 
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It turns out that the notion of less-noisiness tensorizes: 

Proposition 16. If Py^\Xi <i.n. -fV/|Xi and Py 2 \X 2 <i.n. Py^\X 2 

A"i|Xi X Py 2\X2 ^l-n. PyI\Xi X Py^\X2 

In particular, 

i1kl{Py\x) ^ V —i' Py\x -i-n. Pe\x ■ 

where Pe\x ihe erasure channel on the same input alphabet and erasure probability 1 — rj. 

Proof. Construct a relevant joint distribution U —)• X‘^ —>■ Y'"^) and consider 

IiU;YYY2) = I{U;Yi) + I{U-,Y2 \Yi) . (66) 

Now since U X Y 2 \Yi we have by Py 2 \X 2 Ifi.n. Py.^\X 2 

I{U-Y 2 \Yi)<I{U-X\Y^) 
and putting this back into (66) we get 

i{u-,Y yY 2) < i{u-, Xi) + /([/;y2'|yi) = i{u- y^y^) . 

Repeating the same argument, but conditioning on y 2 S^t 

I{U-YYY2)<I{U-YiX): 

as required. The last claim of the proposition follows from Proposition 15. □ 

Consequently, everything that has been said in this paper about r?KL(TV|x) can be restated in 
terms of seeking to compare a given channel in the sense of the <i,n. order to an erasure channel. 
It seems natural, then, to consider erasure channel in somewhat greater details. 

6.1 F/-curve of erasure channels 

Theorem 17. Consider the q-ary erasure channel of blocklength n and erasure probability 5. Its 
Fi-curve is bounded by 

Ff(t, ECg(5)”) < E[min(i3 loggr, t)], R ~ Binom(n, 1 — (5). (67) 

The bound is tight in the following cases: 

1. at t = k log q with integral k < n if and only if an {n,k,n — k + l)q MDS code exists^^ 

2. for t < log q and t > (n — 1) log q; 

3. for all t when n = 1, 2, 3. 

Remark 3. Introducing B' ~ Binom(n — 1,1 — (5) and using the identity E[i?l{B<(j}] = n(l — 
5)P[R' < a — 1], we can express the right-hand side of (67) in terms of binomial CDFs: 

E[min(B,x)] = x + P[B' < [xj — 1](1 — 5){n — x) — x5F[B' < [xj] 

This implies that the upper bound (67) is piecewise-linear, increasing and concave. 

^'^We remind that a subset C of [qP is called an (n, k, d)q code if \C\ = g* and Hamming distance between any two 
points from C is at least d. A code is called maximum-distance separable (MDS) if d = n — fc -|- 1. This is equivalent 
to the property that projection of C onto any subset of k coordinates is bijective. 
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Proof. Consider arbitrary U —)• —>■ with = ECq{6)^. Let S be random subset of [n] 

which includes each i G [n] independently with probability 1 — 5. A direct computation, shows that 

/([/; E^) = I{U- As, S)=Y, ^[<5 = X^) ( 68 ) 

(TC[n] 

— = cr] min(|( 7 | log q, t) = E[min(i? log q, f)]. (69) 

crC[n] 

From here (67) follows by taking supremum over Pu,X"- 

Claims about tightness follow by constructing U = X'^ and taking A" to be the output of the 
MDS code (so that H{X„) = min(|(T| log q,t)) and invoking the concavity of Fj{t). One also notes 
that [n, 1 , n\q (repetition code) and [n, n — 1 , 2 ] (single parity check code) show tightness at t = log q 
and t = {n — 1) log q. 

Finally, we prove that when t = klogq and the bound (67) is tight then a (possibly non-linear) 
{n, k,n — k + l)q MDS code must exist. First, notice that the right-hand side of (67) is a piecewise- 
linear and concave function. Thus the bound being tight for F/(t) (that is a concave-envelope of 
Fj{t)) should also be tight as a bound for F[{t). Consequently, there must exist U —>■ A”' —)• E'^ 
such that the bound (69) is tight with t = I{U]X^). This implies that we should have 

I{U;X^) = min((T log q, t) (70) 

for all a C [n]. In particular, we have I{U;Xi) = logg and thus H{Xi\U) = 0 and without loss 
of generality we may assume that U = A"’. Again from (70) we have that i7(A") = H{X^) = 
klogq. This implies that X^ is a uniform distribution on a set of size q^ and projection on any k 
coordinates is injective. This is exactly the characterization of an MDS code (possibly non-linear) 
with parameters {n,k,n — k + l)g. □ 

We also formulate some interesting observations for binary erasure channels: 

Proposition 18. For BEC(n, <5) we have: 

1. For n > 3 we have that F[{t) is not concave. More exactly, Fj{t) < Ff{t) for t G (1,2). 

2. For arbitrary n and t < log 2 or t > (n — l)log2 we have Fi(t) = Ff{t) = E[min(i? log 2, t)] 
with B defined in in (67). 

3. For t = 2,n = 4 the bound (67) is not tight and Ff{t) < E[min(7?log2,f)]. 

Proof. First note that in Definition 1 of Fi{t) the supremum is a maximum and and U can be 
restricted to alphabet of size \X\ + 2 . So in particular, Frit) = f if and only if there exists 
/(t/;y") = /, IiU;X'^)<t. 

Now consider t G (1,2) and n = 3 and suppose {U,X^) achieves the bound. For the bound to 
be tight we must have I{U;X^) = t. For the bound to be tight we must have I{U;Xi) = 1 for all 
i, that is H{Xi) = 1, F[{Xi\U) = 0 and F[{X'^\U) = 0. Consequently, without loss of generality we 
may take U = A”. So for the bound to be tight we need to find a distribution s.t. 


H{X^) = i7(Ai, Aa) = 77 (A 2 , A 3 ) = 77(Ai, A 3 ) = t,H{Xi) = H{X 2 ) = HiXs) = 1. (71) 

It is straightforward to verify that this set of entropies satisfies Shannon inequalities (i.e. submod¬ 
ularity of entropy checks), so the main result of [ZY97] shows that there does exist a sequence 
of triples A^ (over large alphabets) which attains this point. We will show, however, that this 
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is impossible for binary-valued random variables. First, notice that the set of achievable entropy 
vectors by binary triplets is a closed subset of (as a continuous image of a compact set). Thus, 
it is sufficient to show that (71) itself is not achievable. 

Second, note that for any pair A, B of binary random variables with uniform marginals we must 
have 

A = B + Z, B JL Z ^ Bern(p). 

Without loss of generality, assume that X 2 = Xi + Z where H{Z) = t — 1 > 0. Moreover, 
H{X^\Xi,X 2 ) = 0 implies that X 3 = f{Xi,X 2 ) for some function /. 

Given Xi we have H{X 3 \Xi = x) = H{X^\X 2 = x) = t — 1 > 0. So the function Xi 1 —)• f{Xi,x) 
should not be constant for either choice of a: G {0,1} and the same holds for X 2 f{x,X 2 ). 
Eliminating cases leaves us with / = Xi + X 2 or / = Xi + X 2 + 1. But then X^ = Xi -|- X 2 = Z 
and H[X'i) < 1 , which is a contradiction. 

Since by Theorem 17 we know that the bound (67) is tight for F/(t) we conclude that 

Fi{t)<Ffit), VtG(l,2). 

To show the second claim consider U = X^ and Xi = ■ ■ ■ = X^ ~ Bern(p) for t < log 2. For 
t > (n — 1) log 2 take to be iid Bern(^) and 

Xn = + • • • + Xn-l + Z , 

where Z ~ Bern(p). This yields I{U;X(^) = H{X„) = |(t| log 2 for every subset a C [n] of size up 
to n — 1 . Consequently, the bound (67) must be tight. 

Finally, third claim follows from Theorem 17 and the fact that there is no [4, 2,3] binary code, 
e.g. [MS77, Corollary 7, Chapter 11]. □ 

Putting together (65) and (67) we get the following upper bound on the concavified Fj-curve 
of n-letter product channels in terms of the contraction coefficient of the single-letter channel. 

Corollary 19. If r]Kh{PY\x) = V, then 

Ff{t, Py\x) — IE[min(B log q, t)], B ~ Binom(n, 1 — 5). 

This gives an alternative proof of Corollary 6 for the case of no feedback. 

6.2 Samorodnitsky’s SDPI 

So far, we have been concerned with bounding the “output” mutual information in terms of a 
certain “input” one. However, frequently, one is interested in bounding some “output” information 
given knowledge of several input ones. For example, for the parallel channel we have shown that 

I{W; Y^) < (1 - (1 - mLiPY\x)r)IiW-, X ^). 

But it turns out that a stronger bound can be given if we have finer knowledge about the joint 
distribution of W and X'^. 

The following bound can be distilled from [Saml5]: 

Theorem 20 (Samorodnitsky). Consider the Bayesian network 

U ^ X^ 


21 


where = OILi ^Yi\Xi *5 ® memoryless ehannel with rji = Then we have 

I{U- Y^) < I{U; Xs\S) = I{U- Xs, S), (72) 

where S X {U, X^,Y'^) is a random subset of[n] generated by independently sampling each element 
i with probability In particular, if ry = rj for all i, then 

I{U-Y^) < Y, (73) 

(TC[n] 

Proof. Just put together characterization (63), tensorization property Proposition 16 to get I{U ; Y"') < 
I{U‘, E^), where E'^ is the output of the product of erasure channels with erasure probabilities l — pi- 
Then the calculation (68) completes the proof. □ 

Remark 4. Let us say that “total” information I(U;X'^) is distributed among subsets of [n] as 
given by the following numbers: 

E ^(U-Xt). 

r6(W) 

Then bound (73) says (replacing Binom(n, ?]) by its mean value rjn): 

imY^)<Ir,n. 

Informally: the only kind of information about U that has a chance to be inferred on the basis of 
Y^ is one that is contained in subsets of X of size at most ryn. 

Remark 5. Another implication of the Theorem is a strengthening of the Mrs. Gerber’s Lemma. 
Fix a single-letter channel Py\x suppose that for some increasing convex function m(-) and all 
random variables X we have 

H{Y) > m{H{X )). 

Then, in the setting of the Theorem we have 

R(T’^) >mf—R(A5|5)V (74) 

\r]n J 

Note that by Han’s inequality (74) is strictly better than the simple consequence of the chain rule: 
H{Y"‘) > nm{H{X^)/n). For the case of Py\x = BSC((5) the bound (74) is a sharpening of the 
Mrs. Gerber’s Lemma, and has been the focus of [Saml5], see also [Ordl6]. To prove (74) let 
X'^ —>■ X” be EC(1 — rj). Then, by Theorem 20 applied to U = Xi, n = i — 1 we have 

H{Xi\Y^-^) > H{Xi\E^-^). 

Thus, from the chain rule and convexity of m(-) we obtain 

H{Y^) = Y H{Yi\Y^-^) >nml^Y H{Xi\E^-^) 

i \ i 

and the proof is completed by computing H{E'^) in two ways: 
nh{p) + HiXs\S) = H{E^) 

= Y H{E,\E^-^) = Y m + pH{X,\E ^-^). 

i i 
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Remark 6. Using Proposition 14 we may also state a divergence version of the Theorem: In the 
setting of Theorem 20 for any pair of distributions Px^ and Qx^ we have 

DiPYn\\QYn)<DiPx,is\\Qxs\s\Ps). 

Similarly, we may extend the argument in the previous remark: If for a fixed Qx , Qy (not necessarily 
related by Py\x) there exists an increasing concave function / such that for all Px and Py = 
Py\x ° Px we have 

D{Px\\Qx) < f{D{PY\\QY)) VPv 

then 

D{PY4iQYr)<nf(—D{Pxsis\\llQ^\Ps) 

ies 
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A Contraction coefficients on general spaces 


A.l Proof of Theorem 2 

We show that 


fr> r, \ Df{QY\\PY) ^ ^ X^{Qy\\Py) 

Vf[PY\x,Px) = sup n > VxPPy\x,Px) = sup ■ 


Q^DfiQxWPx) 


q^xHQxWPxY 


(75) 


where both suprema are over all Qx such that the respective denominator is in (0,oo). With 
the assumption that Px is not a point mass, namely, there exists a measurable set E such that 
Px{E) G (0,1), it is clear that such Qx always exists. For example, let Qx = ^{Px + Px\xeE)^ 

where Px\xeEi-) - ^p^(e) • ^hen \ ^ < i(l + p^) and hence Df{Qx\\Px) < oo since / 

is continuous. Furthermore, Qx Y Px implies that Df{Qx\\Px) Y 0 [Csi67]. 

The proof follows that of [CIR'''93, Theorem 5.4] using the local quadratic behavior of /- 
divergence; however, in order to deal with general alphabets, additional approximation steps are 
needed to ensure the likelihood ratio is bounded away from zero and infinity. 

Fix Qx such that x^{Qx\\Px) < cc- Let A = {x \ ^^{x) < a} where a > 0 is sufficiently large 

such that Qx{A) > 1/2. Let Q'x = Qx\X(^A and Q'y = Py\x ° Q'x- Then 

Q'Y = \Px + (1 - ^)Q'x and Qy = Py\x ° Q'x = \Py + (1 - ^)Q'y- Then we have 


1 

a dPx 


< 2 a H—, 


a 


1 

a ~ dPy 


1 

< 2o + -. 


o 


(76) 


Note that x‘^{Q'x\\Px) = Q{xeA) dominated convergence theorem, 
X^iQ'xWP^) x'^{Qx\\Px) as a —>■ oo. On the other hand, since Q'y -A Qy pointwise, the weak 
lower-semicontinuity of x^-divergence yields liminfa_j.oo X^(Qy l|LV) > X?{Qy\\Py)- Furthermore, 

xHQ'APx) 

xAQ’yWPy)' 


using the simple fact that x^(eT ’+(1 — e)( 5 ||P) = (1 —e)^x^(QI|T’)) we have 


xRQ'f-ITx) 

xAQ’P\Py) 
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Therefore, to prove (75), it suffices to show for each fixed a, for any <5 > 0, there exists Px such 
> x^iQxWPx) _ . 

Df(Qx\\Px) - P{Q'P\Py) °- 


" Then 


IS 


For 0 < e < 1, let Px = ePx + ^Qx^ which induces Py = Py\x ° Px = e^V + ^Qy 

Df{Px\\Px) = Ep_„[/(1 + e(^ - 1))]. Recall from (76) that ^ E [^>1 + 2a]- Since f" 
continuous and f"{P) = 1, by Taylor’s theorem and dominated convergence theorem, we have 
Df{Px\\Px) = ^ix\Qx\\Px){l + o{l)). Analogously, Df{Py\\Py) = ^^x\Q'nPY){l +o{l)). This 
completes the proof of rjf[Px) > r]^ 2 {Px)- 

Remark 7. In the special case of KL divergence, we can circumvent the step of approximating 
by bounded likelihood ratio: By [PW15, Lemma 4.2], since X^(Qy||FV) < X^{Qx\\Px) < oo, we 
have D{Px\\Px) = e^x‘^{Qx\\Px)/‘^ + o{e^) and D{Py\\Py) = e‘^x^{QY\\PY)/‘2 + o{e^), as e 0. 
Therefore < lim.^o < VKhiPx)- Therefore rjKhiPx) > Vx^iPx) 

A.2 Proof of Theorem 3 

We prove 


VKL = 11^2. 


(77) 


First of all, t/kl > Vx'^ follows from Theorem 2. For the other direction we closely follow the 
argument of [CRS94, Theorem 1], Below we prove the following integral representation: 


/•OO 

D{Q\\P) = / xHQ\\P')dt, 
Jo 


(78) 


where P* = 


Then 

D{Qy\\Py) = 


< 


x\QY\\Pi^)dt 

) 

Vx^ ■ X^{Qx\\Px)dt = r]^2D{Qx\\Px)- 


where we used Py = Py\x ° Px- h remains to check (78). Note that 

I — X 


Therefore 


Note that tKQ 


— log x = 

Jo 

D{Q\\P) = f 

Jo 


/ X/, ^dt 

0 ( a ^ + ^)(1 + t ) 

dQ-dP 
dP + Mg 


1 


1 + t 


-E, 




dt 


dQ-dP 

dP+MQ 


= —Ep 


dQ-dP 

dP+MQ 


dQ-dP 

dP+MQ 


= ihJ 


{dQ-dPf 


1+t J dP-\-tdQ 


= (1 + 


Therefore Eg 

t)x^{Q\\P^)^ completing the proof of (78). 

It is instructive to remark how this result was established for finite alphabets originally in [AG76]. 
Consider the map 

Px ^ Vr{Px,Qx) = DiPy\x o Px\\Py\x « Qx) - rD{Px\\Qx) - 

A simple differentiation shows that Hessian of this map at Px is negative-definite if and only if 
r > rj^ 2 {Py\XiPx) and negative semidefinite if and only if r > r/^2(Py|x,Pv) (note that this does 
not depend on Qx)- Thus, taking r = 'r].^2{Py\^x) f^e map Px Vr{Px, Qx) is concave in Px for 
all Qx- Thus, its local extremum at Px = Qx is a global maximum and hence Vr{Px, Qx) E 0. 
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A.3 Proof of Theorem 4 


We shall assume that Px is not a point mass, namely, there exists a measurable set E such that 
Px{E) E (0,1). Define 


mUPx) = sup 


D{Qy\\Py) 


q^D{Qx\\Px) 


where the supremum is over all Qx such that 0 < D{Qx\\Px) < oo. It is clear that such Qx always 
exists (e.g., Qx = Px\xeE and D{Qx\\Px) = log E (0,oo)). Let 


ViiPx) = sup 


I{U-X) 


where the supremum is over all Markov chains U ^ X ^ Y with fixed Pxy such that 0 < I{U]X) < 
oo. Such Markov chains always exist, e.g., U = l^x&E} and then I{U;X) = h{Px{E)) E (0,log2). 
The goal of this appendix is to prove (18), namely 


VKhiPx) = ViiPx) ■ 

The inequality r]j{Px) < rjKhiPx) follows trivially: 


I{U-,Y) = D{PY\u\\PY\Pu)<VKL{Px)D{Pxiu\\Px\Pu) = mL{Px)I{X-U). 

For the other direction, fix Qx such that 0 < D{Qx\\Px) < oo. First, consider the case where 
is bounded, namely, < a for some a > 0 Qx-a.s. For any e < ^, let 1/ ~ Bern(e) and 

define the probability measure Px = ■ Let Px\u=o = Px and Px\u=i = Qx, which defines 

a Markov chain U ^ X ^ Y such that X,Y is distributed as the desired Pxy- Note that 

I{U;Y) eD(Py||Py) + eD(Qy||Py) 

I{U-,X) W{Px\\Px) + eD{Qx\\Px) 

where e = 1 — e and Py = Py\x ° Px- We claim that 

D{Px\\Px) = o{e), (79) 


which, in view of the data processing inequality D{Px\\Px) < P{Py\\Py), implies 
desired. To establish (79), define the function 


f{x,e) = 


loe 

e(l-e) ^ l-e 
(x - l)loge. 


e > 0 
e = 0. 


One easily notices that / is continuous on [0, a] x [0, A] and thus bounded. So we get, by bounded 
convergence theorem. 


-^DiPx\\Px)=^p^ 





dQx 


log e = 0 . 


Qx 


To drop the boundedness assumption on -gpp we simply consider the conditional distribution 
— QxixgA where A = {x ■. ^^{x) < a} and a > 0 is sufficiently large so that Qx{A) > 0. 
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Clearly, as a —oo, we have Qx and Qy —)• Qy pointwise (i.e. Q'y{E) —>■ Qy{E) for every 

measurable set E), where Qy = Py\x ° Q'x- Hence the lower-semicontinuity of divergence yields 


liminf I)(Q^||iV) > D{Qy\\Py) ■ 


Furthermore, since have 


D{Q'x\\Px)= log 


1 


+ 


1 


Qx{A) Qx{A) 




Q 


, dQx ^ f dQx ^ 


(80) 


Since Qx(A) 1, by dominated convergence (note: EQ[|log^^|] < oo) we have D{Q'^\\Px) 
D{Qx\\Px)- Therefore, 


^ D{Q'y\\PY) ^ D{Qy\\Py) 

iim ini- > - 

D{Q'^\\Px) - D{Qx\\Px) 


completing the proof. 


A.4 Proof of Proposition 14 

First, notice the following simple result: 

Z)(Q||AP + Ag) = o(A),A^0 ^ P<g (81) 

Indeed, if P ^ Q then there is a set E with p = P[E] > 0 = Q[E]. Denote the binary divergence 
by d{p\\q) = P(Bern(p)||Bern(g)). Applying data-processing for divergence to A i—>■ 1 e(A), we get 

P(Q||AP + AQ) > d(0||Ap) = log 

and the derivative at A —>• 0 is non-zero. If P <C Q, then let / = ^ and notice 

log A < log(A + A/) < A(/ - 1) log e. 

Dividing by A and assuming A < | we get 


i log(A + A/) 


< Cif + C2 , 


for some absolute constants Ci,C 2 . Thus, by the dominated convergence theorem we get 


1 


-P(Q||AP + XQ) = - dQ[- log(A + A/) 


A 


1 


A 


I dQ{l - f) 


-f) = o. 


Another observation is that 

limP(P||AP + Ag) = P(P||g), ( 82 ) 

A->0 

regardless of the finiteness of the right-hand side (this is a property of all convex lower-semicontinuous 
functions). 

Now, we prove Proposition 14. One direction is easy: if D{Qy\\Py) < D{Qy'\\Py') then 


I{w-,Y) = D{Py^w\\PY\Pw) < D{Py,^w\\PY'\Pw) = I{W-Y') ■ 
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For the other direction, consider an arbitrary pair {Px, Qx)- Let W = Bern(e) and Px\w=o = Px, 
Px\w=i = Qx- Then, we get 

= eDiPyWePy + eQy) + eD{Qy\\ePy + eQy ), 

and similarly for I{W]Y'). Assume that D[Qyi\\Pyi) < oo, for otherwise (62) holds trivially. Then 
Qyi <C Pyi and we get from (81) and (82) that 

I{W-,Y') = eD{Qy,\\Py,) + o{e). (83) 

On the other hand, again from (82) 

I{W;Y)>eD{Qy\\ePy + eQy) = eD{Qy\\Py)+o{e). (84) 

Since by assumption/(VF; y) < I(W;Y') we conclude from comparing (83) to (84) that D{Qy\\Py) < 
P{Qy'\\Py') < oo, completing the proof. 


B Contraction coefficients for binary-input channels 


In this appendix we provide a tight characterization of the KL contraction coefficient for binary- 
input channel Py\Xy where X € {0,1} and Y is arbitrary. Clearly, r/KL(Ty|x) is a function of 
P = Py\x=o Q = Py|jY=i, which we abbreviate as r]{{P,Q}). The behavior of this quantity 
closely resembles that of divergence between distributions. Indeed, we expect r]{{P, Qj) to be bigger 
if P and Q are more dissimilar and, furthermore, 'r]{{P, Qj) = 0 (resp. 1) if and only if P = Q 
(resp. P T Q). Next we show that rj{{P,Q}) is essentially equivalent to Hellinger distance: 

Theorem 21. Consider a binary input channel Py\x '■ {0,1} —)• T with Py\x=o = P ^.nd Py\x=i = 
Q. Then, its contraction coefficient r]KL{PY\x) = Vx^iPyix) — hUP^Q}) satisfies 

< v{{P, Q}) < H\P, Q) - , (85) 


where Hellinger distance is defined as H^{P, Q) = 2 — 2 J ffidPdQ. 

Remark 8. An obvious upper bound is 'q{{P,Q'\) < dyyiPjQ) by Theorem 1, which is worse 
than Theorem 21 since c^tv is small than the square-root of the right-hand side of (85). In fact it 
is straightforward to verify that the upper bound holds with equality when the output Y is also 
binary-valued. In particular. Theorem 21 implies that r]{{P,Q'\) is always within a factor of two 
of H^{P,Q). 


Proof. First notice the identities: 


X^(Bern(a)||Bern(/3)) = 


(« - fif 


X^iaP + aQWfiP + fiQ) = {a- jdf J 


2 fiP-QY^ 


/3P + fiQ' 

where we denote d = 1 — a. Therefore the (input-dependent) y^-contraction 


n , xHaP-I-aQWPP + fiQ) [ {P - Q? 

nA^.MP).Py^x) = “P y(Bern(a)||Ber„{«) = JpTm 


coefficient is given by 
^LC^(PIIQ), 
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where LC/ 3 (P||Q), clearly an /-divergence, is known as the Le Cam divergence (see, e.g., [Vaj09, 
p. 889]). In view of Theorem 3, the input-independent KL-contraction coefficient coincides with 
that of and hence 

V{{P,Q})= sup LCfsiPWQ). 

/3G(0,1) 

Thus the desired bound (85) follows from the characterization of the joint range between pairs 
of /-divergence [HVll], namely, versus LC^, by taking the convex hull of their joint range 
restricted to Bernoulli distributions. Instead of invoking this general result, next we prove (85) 
using elementary arguments. Since LCi/ 2 (T’||Q) = 1 — 2 f y/dPdQ = ^H^{P,Q), the 

left inequality of (85) follows immediately. To prove the right inequality, by Cauchy-Schwartz, note 

that we have (1 - ^H‘^{P,Q)f = (f ^dPdQf = (f \/Id dP + fddQ ^ < f = 

1 - LC; 3 (P||g), for any /? E (0,1). □ 


C Simultaneously maximal couplings 


Lemma 22. Let X and y he Polish spaces. Given any pair of Borel probability measures PxY, Qxy 
on X xy, there exists a coupling vr of Pxy cLnd Qxy, namely, a joint distribution of {X,Y, X' ,Y') 
such that C{X,Y) = Pxy and C{X',Y') = Qxy under vr, such that 

,P')} = d'Tx{PxY,QxY) and 7r{X ^ X'} = d'rxiPx^Qx)- (86) 


Remark 9. After submitting this manuscript, we were informed that this result is the main content 
of [Gol79]. For interested reader we keep our original proof which is different from [Gol79] by relying 
on Kantorovich’s dual representation and, thus, is non-constructive. 

Remark 10. A triply-optimal coupling achieving in addition to (86) also 'k\Y ^ Y'] = dpx{PY-, Qy) 
need not exist. Indeed, consider the example where X,Y are {0,1}-valued and 


Pxy 



Qxy = 


In other words. A, K ~ Bern(l/2) under both P and Q] however, X = Y under P and X = 1 — Y 
under Q. Furthermore, since dTx{Px,Qx) = d^^xiPY iQy) = 0, under any coupling 'Kxyx'Y' of 
Pxy and Qxy that simultaneously couples Px to Qx and Py to Qy maximally, we have X = X' 
and Y = Y', which contradicts X = Y and X' = \ — Y'. On the other hand, it is clear that a 
doubly-optimal coupling (as claimed by Lemma 22) exists: just take X = X' = Y ~ Bern(l/2) 
and Y' = 1 — X'. It is not hard to show that such a coupling also attains the minimum 


min7r[(A, Y) / (A', Y')] + 7r[A / A'] + 7r[y / Y'] = 2. 

TT 


Proof. Define the cost function c{x,y,x',y') = 

Since the indicator of any open set is lower semicontinuous, so is {x,y,x',y') c{x,y,x',y'). 
Applying Kantorovich’s duality theorem (see, e.g., [Vil03, Theorem 1.3]), we have 


min E„c(A,y, A',y') 
7ren(Pxv,Qxr) 


maxEp[/(A,y)]-EQ[ 5 (A,y)]. 

f,9 


(87) 


where / E Li{P),g E Li{Q) and 


fix, y) - gix', y') < c(x, y, x',y'). 


( 88 ) 
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Since the cost function is bounded, namely, c takes values in [0,2], applying [Vil03, Remark 1.3], 
we conclude that it suffices to consider 0 < f, g < 2. Note that constraint ( 88 ) is equivalent to 

f{x, y) - g{x', y') < 2, Vx / x', Vy 7 ^ y' 
fix,y) - g{x,y') < l,Vx,Vy / y' 
f{x,y) - g{x,y) < 0,Vx,Vy 

where the hrst condition is redundant given the range of f,g. In summary, the maximum on the 
right-hand side of (87) can be taken over all /, y satisfying the following constraints: 

0</,y<2 

f{x,y) - g{x,y') < l,Vx,y 7 ^ y' 
f{x,y) - g{x,y) < 0,Vx,y 

Then 

ma.xE,p[f{X,Y)]-EQ[g{X,Y)]= [ inax | / p(x, y)(/>(y) - g(x, y)V'(y)| (89) 

f^9 Jx [Jy J 

where the maximum on the right-hand side is over t/j : y R satisfying 

</>(y) - V’(2/0 < 7 ^ y' (90) 

fp{y)-^iy) < 0,Vy 

The optimization problem in the bracket on the RHS of (89) can be solved using the following 
lemma: 


Lemma 23. Letp,q > 0. Let (x)+ = max{x,0}. Then 

pcj) - qip : 0 < (j) < ijj < 2, sup </> < 1 -^inf V’ [> = I {p - q)+ +{ I {p - q) ] ■ (91) 


max 
4>,'P [Jy~ 


Proof. First we show that it suffices to consider (p = 'if. Given any feasible pair {(p,ip), set (p' = 
max{(/),inf’i/i}. To check that {(p',(p') is a feasible pair, note that clearly cp' takes values in [0,2]. 
Furthermore, sup (p' < sup (p <1 + inf ip <1+ inf cp'. Therefore the maximum on the left-hand side 
of (91) is equal to 


Let a = inf 6. Then 


max \ {p — q)(p ■ 0 < (p < 2, sup 0 < 1 -|- inf < 

<l> [Jy 


max { I (p — q)<P : 0 < (/> < 2, sup (p <1 + \n.{ cp'^ = sup max \ j {p — q)(p : a < (p <2 /\ [1 + a) 

0<a<2 4 > 


sup max { I (p — q)(p '■ a < (p <1 + a 
0<a<l 4> 


sup <{ (1 a) I {p-q)+ +a I {p- g)_ 

0<a<l 


sup ip- q)++a ip- q)\ 
0<a<l [J J ) 

] (P - <1), + [I (P-<,)) ■ 


□ 
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Applying Lemma 23 to (89) for fixed x, we have 


maxEp[/(A,y)]-EQ[5(X,y)] 

f,9 



'y 


y 


{p{x,y) - q{x,y))+ + (p(x) - g(x))+ 

{p{x,y) - q{x,y))+ + / {p{x) - q{x)) 
Jx 


Combining the above with (87), we have 


dxv {PxY , Qxy ) + dxv {Px , Qx ) 


min tt{{X, Y) ^ {X', F')} + 7r{A / X'} = d'Tv{PxY,Q xy) + dTviPx,Qx)- 

’^XYX'Y' 

Since Tr{{X,Y) / {X',Y')} > d^y {Pxy , Q xy) and 7r{X / X'} > d^y{Px,Qx) for any vr, the 
minimizer of the sum on the left-hand side achieves equality simultaneously for both terms, proving 
the theorem. □ 
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